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Abstract. We study the succinctness of the complement and intersection of regular ex- 
pressions. In particular, we show that when constructing a regular expression defining the 
complement of a given regular expression, a double exponential size increase cannot be 
avoided. Similarly, when constructing a regular expression defining the intersection of a 
fixed and an arbitrary number of regular expressions, an exponential and double expo- 
nential size increase, respectively, can in worst-case not be avoided. All mentioned lower 
bounds improve the existing ones by one exponential and are tight in the sense that the 
target expression can be constructed in the corresponding time class, i.e., exponential or 
double exponential time. As a by-product, we generalize a theorem by Ehrenfeucht and 
Zeiger stating that there is a class of DFAs which are exponentially more succinct than 
regular expressions, to a fixed four-letter alphabet. When the given regular expressions 
are one- unambiguous, as for instance required by the XML Schema specification, the com- 
plement can be computed in polynomial time whereas the bounds concerning intersection 
continue to hold. For the subclass of single-occurrence regular expressions, we prove a 
tight exponential lower bound for intersection. 



1. Introduction 

The two central questions addressed in this paper are the following. Given regular 
expressions r,ri, . . . ,rk over an alphabet S, 

(1) what is the complexity of constructing a regular expression defining S* \ L{r), 
that is, the complement of r? 

(2) what is the complexity of constructing a regular expression rp defining L{ri) fi • • • fl 

In both cases, the naive algorithm takes time double exponential in the size of the input. 
Indeed, for the complement, transform r to an NFA and determinize it (first exponential 
step), complement it and translate back to a regular expression (second exponential step). 
For the intersection there is a similar algorithm through a translation to NFAs, taking 
the crossproduct and a retranslation to a regular expression. Note that both algorithms 
do not only take double exponential time but also result in a regular expression of double 
exponential size. In this paper, we exhibit classes of regular expressions for which this double 
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exponential size increase cannot be avoided. Furthermore, when the number k of regular 
expressions is fixed, can be constructed in exponential time and we prove a matching 
lower bound for the size increase. In addition, we consider the fragments of one-unambiguous 
and single-occurrence regular expressions relevant to XML schema languages [21 [3l [131 [IS] • 
Our main results are summarized in Table [H 

The main technical part of the paper is centered around the generalization of a result 
by Ehrenfeucht and Zeiger [8]. They exhibit a class of languages {Zn)n£N each of which can 
be accepted by a DFA of size 0{'n?) but cannot be defined by a regular expression of size 
smaller than 2"~^. The most direct way to define Zn is by the DFA that accepts it: the 
DFA is a graph consisting of n states, labeled to n — 1, which are fully connected and 
the edge between state i and j carries the label aij. It now accepts all paths in the graph, 
that is, all strings of the form ai^^i^ai^^i^ ■ ■ ■ 0,1^.^1^^^. Note that the alphabet over which Zn is 
defined grows quadratically with n. We generalize their result to a four-letter alphabet. In 
particular, we define as the binary encoding of Z^ using a suitable encoding for ajj- and 
prove that every regular expression defining should be at least of size 2". As integers are 
encoded in binary the complement and intersection of regular expressions can now be used 
to separately encode (and slight variations thereof) leading to the desired results. In [9] 
the same generalization as obtained here is attributed to Waizenegger |35j . Unfortunately, 
we believe that proof to be incorrect as we discuss in the full version of this paper. 

Although the succinctness of various automata models have been investigated in depth |14j 
and more recently those of logics over (unary alphabet) strings |15j . the succinctness of reg- 
ular expressions has hardly been addressed. For the complement of a regular expression an 
exponential lower bound is given by Ellul et al [9]. For the intersection of an arbitrary num- 
ber of regular expressions Petersen gave an exponential lower bound |28j . while Ellul et al [9] 
mention a quadratic lower bound for the intersection of two regular expressions. In fact, 
in [9], it is explicitly asked what the maximum achievable blow-up is for the complement 
of one and the intersection of two regular expressions (Open Problems 4 and 5). Although 
we do not answer these questions in the most precise way, our lower bounds improve the 
existing ones by one exponential and are tight in the sense that the target expression can 
be constructed in the time class matching the space complexity of the lower bounds. 

Succinctness of complement and intersection relate to the succinctness of semi-extended 
(RE(n)) and extended regular expressions (RE(n,-i)). These are regular expressions aug- 
mented with intersection and both complement and intersection operators, respectively. 
Their membership problem has been extensively studied |18[ [20l [26l [28l [30] . Furthermore, 
non-emptiness and equivalence of RE(n,-i) is non-elementary [33]. For RE(n), inequiva- 
lence is EXPSPACE-complete [10 [ I16 [ [29]. and non-emptiness is PSPACE-complete |10[I16| even 
when restricted to the intersection of a (non-constant) number of regular expressions [19] . 
Several of these papers hint upon the succinctness of the intersection operator and provide 
dedicated techniques in dealing with the new operator directly rather than through a trans- 
lation to ordinary regular expressions [20[ [28] . Our results present a double exponential 
lower bound in translating RE(n) to RE and therefore justify even more the development 
for specialized techniques. 

A final motivation for this research stems from its application in the emerging area of 
XML-theory [211 [271 [311 134j . From a formal language viewpoint, XML documents can be 
seen as labeled unranked trees and collections of these documents are defined by schemas. A 
schema can take various forms, but the most common ones are Document Type Definitions 
(DTDs) [1] and XML Schema Definitions (XSDs) [32] which are grammar based formalisms 
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complement 


intersection (fixed) 


intersection (arbitrary) 


regular expression 


2-exp 


exp 


2-exp 


one- unambiguous 


poly 


exp 


2-exp 


single-occurrence 


poly 


exp 


exp 



Table 1: Overview of the size increase for the various operators and subclasses. All non- 
polynomial complexities are tight. 

with regular expressions at right-hand sides of rules [23^ [25] . Many questions concerning 
schemas reduce to corresponding questions on the classes of regular expressions used as 
right-hand sides of rules as is exemplified for the basic decision problems studied in 
and [22]. Furthermore, the lower bounds presented here are utilized in [12] to prove, among 
other things, lower bounds on the succinctness of existential and universal pattern-based 
schemas on the one hand, and single-type EDTDs (a formalization of XSDs) and DTDs, 
on the other hand. As the DTD and XML Schema specification require regular expres- 
sions occurring in rules to be deterministic, formalized by Briiggemann-Klein and Wood in 
terms of one- unambiguous regular expressions [6], we also investigate the complement and 
intersection of those. In particular, we show that a one-unambiguous regular expressions 
can be complemented in polynomial time, whereas the lower bounds concerning intersection 
carry over from unrestricted regular expressions. A study in [2] reveals that most of the 
one-unambiguous regular expression used in practice take a very simple form: every alpha- 
bet symbol occurs at most once. We refer to those as single-occurrence regular expressions 
(SOREs) and show a tight exponential lower bound for intersection. 

Outline. In Section 2, we introduce the necessary notions concerning (one-unambiguous) 
regular expressions and automata. In Section 3, we extend the result by Ehrenfeucht and 
Zeiger to a fixed alphabet using the family of languages {Kn)nm- I^i Section 4, we consider 
the succinctness of complement. In Section 5, we consider the succinctness of intersection 
of several classes of regular expressions. We conclude in Section 6. A version of this paper 
containing all proofs is available from the authors' webpages. 

2. Preliminaries 

2.1. Regular expressions 

By N we denote the natural numbers without zero. For the rest of the paper, S always 
denotes a finite alphabet. A T,-string (or simply string) is a finite sequence w = ai ■ ■ ■ 
of S-symbols. We define the length of w, denoted by \w\, to be n. We denote the empty 
string by e. The set of positions of w is {1, . . . ,n} and the symbol of w at position i is Cj. 
By wi ■ W2 we denote the concatenation of two strings wi and 'W2- As usual, for readability, 
we denote the concatenation of wi and W2 by wiW2- The set of all strings is denoted 
by S* and the set of all non-empty strings by A string language is a subset of S*. 
For two string languages L, L' C S*, we define their concatenation L ■ L' to be the set 
{w ■ w' \ w & L,w' L'}. We abbreviate L ■ L - ■ ■ L (i times) by U. 

The set of regular expressions over S, denoted by RE, is defined in the usual way: 0, 
e, and every S-symbol is a regular expression; and when ri and r2 are regular expressions, 
then ri ■ r2, ri + r2, and r| are also regular expressions. 
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By RE(n,-i) we denote the class of extended regular expressions, that is, RE ex- 
tended with intersection and complementation operators. So, when ri and r2 are RE(n,-i)- 
expressions then so are ri n r2 and -ti. By RE(n) and RE(-i) we denote RE extended 
solely with the intersection and complement operator, respectively. 

The language defined by an extended regular expression r, denoted by L(r), is induc- 
tively defined as follows: L(0) = 0; L{e) = {e}; L{a) = {a}; L{rir2) = L{ri) ■ L{r2); 
L{ri + rs) = L(ri) U L(r2); L(r*) = {e} U U=i HrY; ^(n n ra) = Lin) D ^(ra); and 
L(-ri) = S*\i(ri). 

By |J*L]^ Tj, and r^, with /c € N, we abbreviate the expression + • • • + r^, and rr ■ ■ - r 
(/c-times), respectively. For a set S = {ai, . . . ,an} C S, we abbreviate by S the regular 
expression ai + • • • + a„. 

We define the size of an extended regular expression r over S, denoted by |r|, as 
the number of S-symbols and operators occurring in r disregarding parentheses. This is 
equivalent to the length of its (parenthesis- free) reverse Polish form [37j. Formally, |0| = 
|e| = \a\ = 1, for a G S, \rir2\ = I?"! nr2| = l^i +r2| = |ri| + |r2| + 1, and |-ir| = |r*| = |r| + 1. 

Other possibilities considered in the literature for defining the size of a regular expres- 
sion are: (1) counting all symbols, operators, and parentheses (HIT]; or, (2) counting only 
the E-symbols. However, Ellul et al. [9] have shown that for regular expressions (so, with- 
out -1 and n), provided they are preprocessed by syntactically eliminating superfluous 0- 
and e-symbols, and nested stars, the three length measures are identical up to a constant 
multiplicative factor. For extended regular expressions, counting only the S-symbols is not 
sufficient, since for instance the expression (-ie)(-ie)(-ie) does not contain any S-symbols. 
Therefore, we define the size of an expression as the length of its reverse Polish form. 

2.2. One-unambiguous regular expressions and SOREs 

As mentioned in the introduction, several XML schema languages restrict regular 
expressions occurring in rules to be deterministic, formalized by Briiggemann-Klein and 
Wood [6] in terms of one-unambiguity. We introduce this notion next. 

To indicate different occurrences of the same symbol in a regular expression, we mark 
symbols with subscripts. For instance, the marking of (a + b)*a + be is (oi + 62)*03 + ^4C5. 
We denote by r^ the marking of r and by Sym(r^) the subscripted symbols occurring in r^. 
When r is a marked expression, then over S is obtained from r by dropping all subscripts. 
This notion is extended to words and languages in the usual way. 

Definition 2.1. A regular expression r is one-unambiguous iff for all strings w,u,v € 
Sym(r^)*, and all symbols x,y G Sym(r''), the conditions uxv, uyw € L{r^) and x ^ y imply 
x^ / y^. 

For instance, the regular expression r = a*a, with marking r^ = a\a2, is not one- 
unambiguous. Indeed, the marked strings 0102 and 010102 both in L{r^) do not satisfy 
the conditions in the previous definition. The equivalent expression oo*, however, is one- 
unambiguous. The intuition behind the definition is that positions in the input string can 
be matched in a deterministic way against a one-unambiguous regular expression without 
looking ahead. For instance, for the expression aa* , the first o of an input string is always 
matched against the leading a in the expression, while every subsequent a is matched against 
the last o. Unfortunately, one-unambiguous regular languages do not form a very robust 
class as they are not even closed under the Boolean operations [6]. 
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The following subclass captures the class of regular expressions occurring in XML 
schemas on the Web [2]: 

Definition 2.2. A single- occurrence regular expression (SORE) is a regular expression 
where every alphabet symbol occurs at most once. In addition, we allow the operator r"*" 
which defines rr*. 

For instance, (a + b)~^c is a SORE while a* (a + b)^ is not. Clearly, every SORE is 
one-unambiguous. Note that SOREs define local languages and that over a fixed alphabet 
there are only finitely many of them. 

2.3. Finite automata 

A non-deterministic finite automaton (NFA) ^ is a 4-tuple {Q,qoi F) where Q is the 
set of states, qo is the initial state, F is the set of final states and JCQxSxQis the 
transition relation. We write q =>A,to q' when w takes A from state q to q' . So, w is accepted 
by A if qo =^a,w q' for some q' G F. The set of strings accepted by A is denoted by L{A). 
The size of an NFA is |Q| + An NFA is deterministic (or a DFA) if for ah a € S, G Q, 
\{iq,a,q')e6\q' eQ}\<l. 

We make use of the following known results. 

Theorem 2.3. Let Ai, . . . , Am be NFAs over S with \Ai\ = rii for i < m, and |S| = k. 

(1) A regular expression r, with L{r) = L{Ai), can be constructed in time 0{mikA"^^), 
where mi is the number of states of Ai |9] . 

(2) A DFA B with 2"^^ states, such that L{B) = L{Ai), can be constructed in time 
C»(2"i) [36]. 

(3) A DFA B with 2"^ states, such that L[B) = S* \ L{Ai), can be constructed in time 
C»(2"i) [36]. 

(4) Letr G RE. An NFA B with |r| + l states, such that L[B) = L{r), can be constructed 
in time 0{\r\ ■ |S|) [5]. 

(5) Let r G RE(n). An NFA B with 2l''l states, such that L{B) = L{r), can be con- 
structed in time exponential in the size of r |10j . 

3. A generalization of a Theorem by Ehrenfeucht and Zeiger to a fixed 
alphabet 

We first introduce the family {Zn)n&n defined by Ehrenfeucht and Zeiger over an al- 
phabet whose size grows quadratically with the parameter n [8]: 

Definition 3.1. Let n G N and S„ = {oj.j | < i,j < n — 1}. Then, Z„ contains exactly 
all strings of the form ai^^i^ai-^^^i^ ■ ■ ■ ai^_^^i^ where /c G N. 

A way to interpret Zn is to consider the DFA with states {0, . . . , n — 1} which is fully 
connected and where the edge between state i and j is labeled with Oij. The language Zn 
then consists of all paths in the DFA. 

Ehrenfeucht and Zeiger obtained the succinctness of DFAs with respect to regular ex- 
pressions through the following theorem: 



Actually, in [8], only paths from state to state n — 1 are considered. We use our slightly modified 
definition as it will be easier to generalize to a fixed arity alphabet suited for our purpose in the sequel. 
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Theorem 3.2 ([!]). For n S N, any regular expression defining Zn must be of size at least 
2"^^. Furthermore, there is a DFA of size 0{n'^) accepting 

Our language is then the straightforward binary encoding of Zn that additionaUy 
swaps the pair of indices in every symbol ajj. Thereto, for ajj G S„, define the function 
Pn as 

Pn{ai,j) = enc(j)$enc(z)#, 
where enc(i) and enc(j) denote the [log(n)]-bit binary encodings of i and j, respectively. 
Note that since i,j < n, i and j can be encoded using only [log(n)]-bits. We extend the 
definition of pn to strings in the usual way: Pniaio,ii ■ ■ ■ o-ik-i,ik) = Pn{aio,h) ■ ■ ■ Pn{ai^_i,i^). 
We are now ready to define Kn- 

Definition 3.3. Let T^k = {0, 1, $, #}. For n G N, let Kn = {pn{w) \ w G Z„}. 

For instance, for n = 5, w = 03,202.101,404.2 G Z^ and thus 

Pn{w) = 010$011#001$010#100$001#010$100# G K5. 
We generalize the previous theorem as follows: 

Theorem 3.4. For any n G N, with n>2, 

(1) any regular expression defining Kn is of size at least 2"; and, 

(2) there is a DFA An of size ©(n^logn) defining Kn- 

The construction of An is omitted. The rest of this section is devoted to the proof 
of Theorem 13.4( 1). It follows the structure of the proof of Ehrenfeucht and Zeiger but is 
technically more involved as it deals with binary encodings of integers. 

We start by introducing some terminology. Let w = ai^^i-^^ai^^i^ ■ ■ ■ ai^_^^i^ G Zn- We say 
that io is the start-point of w and is its end-point. Furthermore, we say that w contains 
i or i occurs in w i occurs as an index of some symbol in w- That is, Oi,j or aj^i occurs in 
w for some j- For instance, 00,202,202,1 G Z^, has start-point 0, end-point 1, and contains 
0, 1 and 2. The notions of contains, occurs, start- and end-point of a string w are also 
extended to Kn- So, the start and end-points of Pn{w) are the start and end-points of w, 
and w contains the same integers as pn{w)- 

For a regular expression r, we say that i is a sidekick of r when it occurs in every non- 
empty string defined by r. A regular expression s is a starred subexpression of a regular 
expression r when s is a subexpression of r and is of the form t* - 

Now, the following lemma holds: 

Lemma 3.5. Any starred subexpression s of a regular expression r defining Kn has a 
sidekick- 

We now say that a regular expression r is normal if every starred subexpression of r 
has a sidekick. In particular, any expression defining Kn is normal. We say that a regular 
expression r covers a string w if there exist strings n, n' G S* such that uwu' G L{r). If 
there is a greatest integer m for which r covers w"^ , we call m the index of w; in r and denote 
it by Iw{f)- In this case we say that r is w- finite. Otherwise, we say that r is w-infinite. 
The index of a regular expression can be used to give a lowerbound on its size according to 
the following lemma. 
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Lemma 3.6 ([8]). For any regular expression r and string w, if r is w-finite, then Iwir) < 
2\r\E 

Now, we can state the most important property of Kn- 

Lemma 3.7. Let n > 2. For any C Q {0, . . . ,n — 1} of cardinality k and i £ C, there 
exists a string w S Kn with start- and end-point i only containing integers in C , such that 
any normal regular expression r which covers w is of size at least 2^ . 

Proof. The proof is by induction on the value of k. For k = 1, C = {i}. Then, define 
w = enc(i)$enc(i)7^, which satisfies all conditions and any expression covering w must 
definitely have a size of at least 2. 

For the inductive step, let C = {ji, . . . ,jk}- Define Ci = C \ mod fc)+i} aiid let W£ 
be the string given by the induction hypothesis with respect to Ci (of size k — 1) and ji. 
Note that ji S Cc. Further, define m = 2'^"'"^ and set 

w = enc(ii)$enc(i)#t(;"'enc(j2)$enc(ji)#'u;^enc(j3)$enc(j2)# • • • w;™enc(i)$enc(jfc)#. 

Then, w G Kn, has i as start and end-point and only contains integers in C. It only remains 
to show that any expression r which is normal and covers w is of size at least 2'^. 

Fix such a regular expression r. If r is w^-finite for some i < k. Then, Iwi{i'k) > m = 
2^+1 |-,y construction of w. By Lemma 13.61 |r| > 2^ and we are done. 

Therefore, assume that r is wi-m&mte for every i < k. For every i < k, consider all 
subexpressions of r which are w^-infinite. It is easy to see that all minimal elements in this 
set of subexpressions must be starred subexpressions. Here and in the following, we say 
that an expression is minimal with respect to a set simply when no other expression in the 
set is a subexpression. Indeed, a subexpression of the form a or e can never be lo^-infinite 
and a subexpression of the form rir2 or ri + r2 can only be w^-infinite if ri and/or r2 
are Tii^-infinite and is thus not minimal with respect to iM^-infinity. Among these minimal 
starred subexpressions for W£, choose one and denote it by sg. Let E = {si, . . . , s^}- Note 
that since r is normal, all its subexpressions are also normal. As in addition each S£ covers 
W£, by the induction hypothesis the size of each S£ is at least 2'^"^. Now, choose from E 
some expression S£ such that S£ is minimal with respect to the other elements in E. 

As r is normal and S£ is a starred subexpression of r, there is an integer j such that 
every non-empty string in L(si) contains j. By definition of the strings wi, . . . ,Wk, there is 
some Wp, p < k, such that Wp does not contain j. Denote by Sp the starred subexpression 
from E which is Wp-infinite. In particular, s^ and Sp cannot be the same subexpression of r. 

Now, there are three possibilities: 

• S£ and Sp are completely disjoint subexpressions of r. That is, they are both not 
a subexpression of one another. By induction they must both be of size 2^~^ and 
thus |r| > 2^-^ + 2^"^ = 2^. 

• Sp is a strict subexpression of s^. This is not possible since s^ is chosen to be a 
minimum element from E. 

• is a strict subexpression of Sp. We show that if we replace s^ by e \n Sp, then Sp 
is still Wp-infinite. It then follows that Sp still covers Wp, and thus Sp without s^ is 
of size at least 2^~^ . As \si\ > 2^~^ as well it follows that |r| > 2^. 



In fact, in [8] the length of an expression is defined as the number of E-symboIs occurring in it. However, 
since our length measure also contains these E-symbols, this lemma still holds in our setting. 
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To see that Sp without Si is stih Wp-infinite, recah that any non-empty string 
defined by si contains j and j does not occur in Wp. Therefore, a full iteration of si 
can never contribute to the matching of any number of repetitions of Wp. So, Sp can 
only lose its Wp-infinity by this replacement if S£ contains a subexpression which is 
itself tfp-infinite. However, this then also is a subexpression of Sp and Sp is chosen 
to be minimal with respect to i(;p-infinity, a contradiction. We can only conclude 
that Sp without is still Wp-infinite. 

■ 

Since by Lemma [33] any expression defining Kn is normal. Theorem 13.4( 1) directly fol- 
lows from Lemma [3.7l by choosing i = 0, k = n. This concludes the proof of Theorem l3.4( l). 

4. Complementing regular expressions 

It is known that extended regular expressions are non-elementary more succinct than 
classical ones (3133]. Intuitively, each exponent in the tower requires nesting of an additional 
complement. In this section, we show that in defining the complement of a single regular 
expression, a double-exponential size increase cannot be avoided in general. In contrast, 
when the expression is one-unambiguous its complement can be computed in polynomial 
time. 

Theorem 4.1. (1) For every regular expression r over S, a regular expression s with 
L{s) = S* \ L(r) can be constructed in time 0(2l''l+^ • |S| • 4^'*^'^^). 
(2) Let Ti he a four-letter alphabet. For every n € N, there is a regular expressions r^ 
of size 0{n) such that any regular expression r defining T,* \L(r„) is of size at least 
22". 

Proof. (2) Take S as Hk, that is, {0, 1, $, Let n £ N. We define an expression r.„ of size 
0{n), such that S* \ L{rn) = -K^2"- By Theorem 13. 4| any regular expression defining 
is of size exponential in 2", that is, of size 2^". By r^'''""^] we abbreviate the expression 
{e-\-r{£-\-r{e ■ ■ ■ (e+r)))), with a nesting depth of n — 1. We then define r„ as the disjunction 
of the following expressions: 

• all strings that do not start with a prefix in (0 -|- 1)""$: 

sIM + (0 + i)[o."-il($ + + (0 + 1)"(0 + 1 + 

• all strings where a $ is not followed by a string in (0 -|- 1)"#: 

+ $) + S"(0 + 1 + 

• all strings where a non-final # is not followed by a string in (0 -|- 1)'"$: 

+ $) + S"(0 + 1 + 

• all strings that do not end in 

S*(0 + l + $) 

• all strings where the corresponding bits of corresponding blocks are different: 

((0 + 1)* + S*#(0 + 1)*)0T,^'^+^1T,* + ((0 + 1)* + S*#(0 + 1)*)1S3"+20S*. 

It should be clear that a string over {0, 1, $, ^} is matched by none of the above expressions 
if and only if it belongs to ■ So, the complement of r„ defines exactly ■ ■ 
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The previous theorem essentially shows that in complementing a regular expression, 
there is no better algorithm than translating to a DFA, computing the complement and 
translating back to a regular expression which includes two exponential steps. However, 
when the given regular expression is one-unambiguous, a corresponding DFA can be com- 
puted in quadratic time through the Glushkov construction [6] eliminating already one 
exponential step. The proof of the next theorem shows that the complement of that DFA 
can be directly defined by a regular expression of polynomial size. 

Theorem 4.2. For any one-unambiguous regular expression r over an alphabet a regular 
expression s defining S* \ L{r) can be constructed in time 0{n^), where n is the size of r. 

Proof. Let r be a one- unambiguous expression over S. We introduce some notation. 

• The set Not-First (r) contains all S-symbols which are not the first symbol in any 
word defined by r, that is, Not-First (r) = S\{a|aEi;A 3w € T,* ,aw € L{r)} . 

• For any symbol x G Sym(r^), the set Not-Follow(r, x) contains all S-symbols of 
which no marked version can follow x in any word defined by That is, Not-Follow(r, 
^\{y^ \ y ^ Sym(r'') A 3w,w' G SyTa.{r")* ,wxyw' G L{r^)}. 

• The set Last(r) contains all marked symbols which are the last symbol of some word 
defined by Formally, Last(r) = {x | x G Sym(r'') A 3^ G S*,fi;x G L{r^)}. 

We define the following regular expressions: 

■ -tM - / Not-First(r)S* ifeGL(r);and 

• mit(rj - I £ + Not-First (r)S* if e ^ L(r). 

• For every x G Sym(r''), let be the expression defining {wx j w G Sym(r'')* A 3u G 
Sym.{r^)* ^wxu G L{r^)}. That is, all prefixes of strings in r" ending in x. Then, let 
rx define i(r^)''. 

We are now ready to define s: 

init(r) + y r^ie + Not-Follow(r, x)S*) + |J r^Not-Follow(r, x)S*. 

3;^Last(r) xGLast(r) 

It can be shown that s can be constructed in time cubic in the size of r and that s defines 
the complement of r. The latter is proved by exhibiting a direct correspondence between s 
and the complement of the Glushkov automaton of r. ■ 

We conclude this section by remarking that one-unambiguous regular expressions are 
not closed under complement and that the constructed s is therefore not necessarily one- 
unambiguous. 

5. Intersecting regular expressions 

In this section, we study the succinctness of intersection. In particular, we show that 
the intersection of two (or any fixed number) and an arbitrary number of regular expres- 
sions are exponentially and double exponentially more succinct than regular expressions, 
respectively. Actually, the exponential bound for a fixed number of expressions already 
holds for single-occurrence regular expressions, whereas the double exponential bound for 
an arbitrary number of expressions only carries over to one-unambiguous expressions. For 
single-occurrence expressions this can again be done in exponential time. 

In this respect, we introduce a slightly altered version of K^- 
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Definition 5.1. Let Sl = {0, !,$,#, A}. For all n G N, L„ = {pn{w)A \ w G Z„ A 
is even}. 

We also define a variant of Zn which only slightly alters the aij symbols in Z„. Thereto, 

let S° = {aioj,aijo \ < i,j < n} and set p{aijaj^k) = >iai,j°aj°,k and p{aiQ^i^ai^^i^ 
■ • • ««fe-2,ifc-i"ifc-i,iJ = P{0'io,ii"'ii,i2) ■ ■ ■ P{<^ik^2,ik-i'^ik-i,ik)^ ^here k is even. 

Definition 5.2. Let n G N and S^.^ = S° U {Oq, Aq, . . . , On-i, A„_2}. Then, M„ = 
{p{w)l\i \ w ^ Zn /\\w\ is even A Hs the end-point of w}. 

Note that paths in M„ are those in Z„ where every odd position is promoted to a circled 
one (°), and triangles labeled with the non-circled positions are added. For instance, the 
string 02,404^303^303^0 £ -^5 is mapped to the string l>202,4°04o,3 >3 03,30030,0^0 G M5. 

We make use of the following property: 

Lemma 5.3. Let n G N. 

(1) Any regular expression defining Ln is of size at least 2". 

(2) Any regular expression defining Mn is of size at least 2"^-^. 

The next theorem shows the succinctness of the intersection operator. 

Theorem 5.4. (1) For any G N and regular expressions ri, . . . ,rk, a regular expres- 
sion defining Cl^^f. L{rk) can be constructed in time 0((m + l)'^ ■ |S| ■4:^"^'^^^ ), where 
m = max{|rj| | 1 < i < A;}. 

(2) For every n G N, there are SOREs r„, and Sn of size 0{n'^) such that any regular 
expression defining L{rn) H L(s„) is of size at least 2""^. 

(3) For each r G REfO) an equivalent regular expression can be constructed in time 
0(2lH . .42''''). 

(4) For every n G N, there are one-unambiguous regular expressions ri, . . . ,rm, with 
m = 2n+ 1, of size 0{n) such that any regular expression defining r\i<m ^('''i) 

size at least 2^" . 

(5) Let ri,...,r„ be SOREs. A regular expression defining nj<n -^('"") 
structed in time 0{m • |S| • 4™), where m = X]j<„ 

Proof. (2) Let n G N. By Lemma l5.3l ![2]l. any regular expression defining M„ is of size at 
least 2"""*^. We define SOREs r„ and s„ of size quadratic in n, such that L(r„)nL(s„) = M„. 
We start by partitioning in two different ways. To this end, for every i < n, define 
Outj = {oijo I < j < n}, luj = {ojo j I < j < n}, Outjo = {ojo j \ < j < n}, and, 
lujo = {oj,io \ < j < n}. Then, 

i i° 

Further, define 

vn = ((>o + --- + t>n-i)|JlnioOutio)+(Ao + --- + A„_i) 

i° 

and 

Sn= (U(Ini + e)(>i + A,)(Out, + 

i 

Now, r„ checks that every string consists of a sequence of blocks of the form t>iaj^k°o-k°/i 
for i,j,k,£ < n, ending with a Aj, for i < n. It thus sets the format of the strings and 
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checks whether the circled indices are equal. Further, Sn checks whether the non-circled 
indices are equal and whether the triangles have the correct indices. Since the alphabet of 
M„ is of size O(n^), also r„ and s„ are of size 0{n'^). 

(4) Let n € N. We define m = 2n + 1 one-unambiguous regular expressions of size 0{n), 
such that their intersection defines L2"- By Lemma |5.3I |T]). any regular expression defining 

is of size at least 2^" and the theorem follows. For ease of readability, we denote 
simply by E. The expressions are as follows. There should be an even length sequence of 
blocks: 

((0 + ir$(o + ir#(o + ir$(o + ir#)*A. 

For all i S {0, . . . ,n — 1}, the {i + l)th bit of the two numbers surrounding an odd ^ should 
be equal: 

(EXOS3"+2o + ls3"+2l)S"-'-i#)*A. 
For all i G {0, . . . , n — 1}, the (i + l)th bit of the two numbers surrounding an even ^ should 
be equal: 

Clearly, the intersection of the above expressions defines L2" • Furthermore, every expression 
is of size 0{n) and is one-unambiguous as the Glushkov construction translates them into 
aDFA[6]. ■ 

6. Conclusion 

In this paper we showed that the complement and intersection of regular expressions 
are double exponentially more succinct than ordinary regular expressions. For comple- 
ment, complexity can be reduced to polynomial for the class of one-unambiguous regular 
expressions although the obtained expressions could fall outside that class. For intersection, 
restriction to SOREs reduces complexity to exponential. It remains open whether there are 
natural classes of regular expressions for which both the complement and intersection can 
be computed in polynomial time. 
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